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arXiv:2301.01874 [pdf, other] c: cs.CY cs.LG 


Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI 

Authors: Lorena Piedras, Lucas Rosenblatt, Julia Wilkins 

Abstract: ...by proposing a new benchmark, Selected Adversarial SemanticS, or SASS. We evaluate PERSPECTIVE 
on SASS, and compare to low-effort alternatives, like zero-shot and few-shot GPT-3 prompt models, in binary 
classification settings. We find that PERSPECTIVE exhibits troubling shortcomings across a number of our toxicity 
categories. SASS provides a new tool for ev... v More 

Submitted 4 January, 2023; originally announced January 2023. 

ACM Class: |.2.7 

Journal ref: NLP for Positive Impact at EMNLP 2022 


. arXiv:2301.01820 [pdf, ps, other] cs.Al 


InPars-v2: Large Language Models as Efficient Dataset Generators for Information 
Retrieval 

Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo 
Nogueira 

Abstract: ...for documents. These synthetic query-document pairs can then be used to train a retriever. However, 
InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such 
datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing 
powerful rerankers to select synthetic query-do... v More 

Submitted 4 January, 2023; originally announced January 2023. 


arXiv:2301.01764 [pdf, other] 
UniHD at TSAR-2022 Shared Task: Is Compute All We Need for Lexical Simplification? 
Authors: Dennis Aumiller, Michael Gertz 


Abstract: ...requires deep technical knowledge and fine-tuned interaction to achieve its full potential. As an 
alternative, we describe a frustratingly simple pipeline based on prompted GPT-3 responses, beating competing 


approaches by a wide margin in settings with few training instances. Our best-performing submission to the 
English language track of the TSAR-2022 share... v More 


Submitted 5 January, 2023; v1 submitted 4 January, 2023; originally announced January 2023. 
Comments: Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022) at EMNLP 2022 


4. arXiv:2301.01215 [pdf, ps, other] math-ph 
Comment on 'The operational foundations of PT-symmetric and quasi-Hermitian 
quantum theory’ 
Authors: Miloslav Znojil 
Abstract: ...the standard quantum theory) is not new. In a related comment on the author's method of proof 
performed in “the framework of general probabilistic theories" (GPT) we add that also in this context a few other, 
mathematically consistent GPT-like theories are already available in the literature (pars pro tot... v More 


Submitted 3 January, 2023; originally announced January 2023. 
Comments: 7 pp 


5. arXiv:2301.01181 [pdf] cs.CY 


Large Language Models as Corporate Lobbyists 

Authors: John J. Nay 

Abstract: ...the performance of the model, which outperforms the baseline of predicting the most common 
outcome of irrelevance. We also benchmark the performance of the previous OpenAl GPT-3 model (text-davinci- 
002), which was state-of-the-art on many language tasks until text-davinci-003 was recently released. The 
performance of text-davinci-002 is worse than simply alw... y More 


Submitted 5 January, 2023; v1 submitted 3 January, 2023; originally announced January 2023. 


Comments: Our open-source code available here: https://github.com/JohnNay/IIm-lobbyist. arXiv admin note: text overlap with 
arXiv:2209.13020 


6. arXiv:2301.00774 [pdf, other] 
Massive Language Models Can Be Accurately Pruned in One-Shot 


Authors: Elias Frantar, Dan Alistarh 


Abstract: We show for the first time that large-scale generative pretrained transformer (GPT) family models can 
be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved 
via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive... 
v More 


Submitted 2 January, 2023; originally announced January 2023. 


7. arXiv:2301.00665 [pdf, ps, other] c cs.CR  cs.LG 
Targeted Phishing Campaigns using Large Scale Language Models 


Authors: Rabimba Karanjai 


Abstract: In this research, we aim to explore the potential of natural language models (NLMs) such as GPT-3 and 
GPT-2 to generate effective phishing emails. Phishing emails are fraudulent messages that aim to trick individuals 
into revealing sensitive information or taking actions that benefit the attackers. We propose a framewo... y More 
Submitted 29 December, 2022; originally announced January 2023. 


8. arXiv:2301.00303 [pdf, other] c: cs.Al 


Rethinking with Retrieval: Faithful Large Language Model Inference 
Authors: Hangfeng He, Hongming Zhang, Dan Roth 


Abstract: ...does not require additional training or fine-tuning and is not limited by the input length of LLMs. We 
evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: 
commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce 
more faithful explanations and improve the per... v More 

Submitted 31 December, 2022; originally announced January 2023. 


9. arXiv:2301.00184 [pdf, other] 
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? 


= 
=à 


Authors: Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang 


Abstract: ...to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video 
captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline 
videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for 
text-video retrieval? In this pa... v More 

Submitted 31 December, 2022; originally announced January 2023. 

Comments: Technical report 


. arXiv:2212.14402 [pdf, other] c: cs.Al cs.LG 


GPT Takes the Bar Exam 
Authors: Michael Bommarito II, Daniel Martin Katz 


Abstract: ...the state of the art in "Al?" In this research, we document our experimental evaluation of the 
performance of OpenAl's ‘text-davinci-003° model, often-referred to as GPT-3.5, on the multistate multiple choice 
(MBE) section of the exam. While we find no benefit in fine-tuning over... v More 

Submitted 29 December, 2022; originally announced December 2022. 

Comments: Additional material available online at https://github.com/mjbommar/gpt-takes-the-bar-exam 


. arXiv:2212.14206 [pdf] cs.IR cs.LG 


Maximizing Use-Case Specificity through Precision Model Tuning 
Authors: Pranjali Awasthi, David Recio-Mitter, Yosuke Kyle Sugi 


Abstract: ...of the performance of four transformer-based language models on the task of biomedical information 
retrieval. The models we consider are DeepMind's RETRO (7B parameters), GPT-J (6B parameters), GPT-3 (175B 
parameters), and BLOOM (176B parameters). We compare their performance on the basis of relevance, accuracy, 
an... V More 

Submitted 29 December, 2022; originally announced December 2022. 

Comments: 9 pages, 4 figures 

ACM Class: H.3.3 


. arXiv:2212.14047 [pdf] cs.Al cs.HC 


Using Large Language Models to Generate Engaging Captions for Data Visualizations 
Authors: Ashley Liew, Klaus Mueller 

Abstract: ...turns out that the key challenge lies in designing the most effective prompt for the LLM, a task called 
prompt engineering. We report on first experiments using the popular LLM GPT-3 and deliver some promising 
results. v More 

Submitted 27 December, 2022; originally announced December 2022. 


. arXiv:2212.13456 [pdf, other] 


TegFormer: Topic-to-Essay Generation with Good Topic Coverage and High Text 
Coherence 


Authors: Wang Qi, Rui Liu, Yuan Zuo, Yong Chen, Dell Zhang 


Abstract: ...Moreover, an \emph{Embedding-Fusion} module that combines the domain-specific word 
embeddings learnt from the given corpus and the general-purpose word embeddings provided by a GPT-2 model 
pre-trained on massive text data is integrated into the decoder. Since GPT-2 is at a much larger scale, it contains a 
lot more imp... 7 More 

Submitted 27 December, 2022; originally announced December 2022. 


. arXiv:2212.13392 [pdf, other] c: cs.Al cs.CV cs.LG 


DeepCuts: Single-Shot Interpretability based Pruning for BERT 

Authors: Jasdeep Singh Grover, Bhavesh Gawri, Ruskin Raj Manku 

Abstract: ...in parameters and layers, it has become much harder to train and infer with them on single GPUs. 
This is severely restricting the availability of large language models such as GPT-3, BERT-Large, and many others. A 
common technique to solve this problem is pruning the network architecture by removing transformer heads, 
fully-connected weights, and other modul... v More 

Submitted 27 December, 2022; originally announced December 2022. 
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19. 


Comments: 13 pages, 12 figures, 10 equations, initial preprint 


arXiv:2212.13196 [pdf] 


Biologically Inspired Design Concept Generation Using Generative Pre-Trained 
Transformers 

Authors: Qihao Zhu, Xinyu Zhang, Jianxi Luo 

Abstract: ...language model (PLM) to automatically retrieve and map biological analogy and generate BID in the 
form of natural language. The latest generative pre-trained transformer, namely GPT-3, is used as the base PLM. 
Three types of design concept generators are identified and fine-tuned from the PLM according to the looseness 
of the problem space representation. Ma... v More 

Submitted 26 December, 2022; originally announced December 2022. 

Comments: Accepted by J. Mech. Des. arXiv admin note: substantial text overlap with arXiv:2204.09714 


arXiv:2212.12411 [pdf, other] c: cs.LG 


Benchmark for Uncertainty & Robustness in Self-Supervised Learning 

Authors: Ha Manh Bui, Iliana Maifeld-Carucci 

Abstract: ...models. In this paper, we explore variants of SSL methods, including Jigsaw Puzzles, Context, Rotation, 
Geometric Transformations Prediction for vision, as well as BERT and GPT for language tasks. We train SSL in 
auxiliary learning for vision and pre-training for language model, then evaluate the generalization (in-out 
classification accuracy) and uncertaint... v More 

Submitted 23 December, 2022; originally announced December 2022. 


Comments: 15 pages, 3 tables, 6 figures, the class project in CSCI 601.771: Self-supervised Statistical Models - Johns Hopkins 
University - Fall 2022 


arXiv:2212.12131 [pdf, other] 


Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer 
Fit to Human Reading Times? 

Authors: Byung-Doh Oh, William Schuler 

Abstract: ...times. First, regression analyses show a strictly monotonic, positive log-linear relationship between 
perplexity and fit to reading times for the more recently released five GPT-Neo variants and eight OPT variants on 
two separate datasets, replicating earlier results limited to just GPT-2 (Oh et al., 2022). Subsequent... v More 
Submitted 22 December, 2022; originally announced December 2022. 

Comments: Transactions of the Association for Computational Linguistics (pre-MIT Press publication version) 


arXiv:2212.11275 [pdf, other] c: cs.Al cs.LG 


KL Regularized Normalization Framework for Low Resource Tasks 

Authors: Neeraj Kumar, Ankur Narang, Brejesh Lall 

Abstract: Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for 
learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a 
large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant 
amount of research has been... v More 

Submitted 21 December, 2022; originally announced December 2022. 

Comments: arXiv admin note: text overlap with arXiv:2106.05469 by other authors 


arXiv:2212.11185 [pdf, other] 


Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading 
Times Over and Above GPT-2 Surprisal 

Authors: Byung-Doh Oh, William Schuler 

Abstract: ...attention weights, we also experiment with alternative methods for incorporating vector norms into 
attention weights. Regression experiments using predictors calculated from the GPT-2 language model show that 
these predictors deliver a substantially better fit to held-out self-paced reading and eye-tracking data over a 
rigorous baseline including... v More 


Submitted 21 December, 2022; originally announced December 2022. 
Comments: EMNLP 2022 


20. arXiv:2212.11118 [pdf, other] c: cs.LG cs.NE 
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NP4G : Network Programming for Generalization 

Authors: Shoichiro Hara, Yuji Watanabe 

Abstract: ...programming has been actively studied for a long time by various approaches including genetic 
programming. In recent years, automatic programming using neural networks such as GPT-3 has been actively 
studied and is attracting a lot of attention. However, these methods are illogical inference based on experience by 
enormous learning, and their thinking proces... v More 

Submitted 8 December, 2022; originally announced December 2022. 


. arXiv:2212.10755 [pdf, other] 


JASMINE: Arabic GPT Models for Few-Shot Learning 


Authors: El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, AbdelRahim Elmadany, Alcides Alcoba Inciarte, 
Md Tawkat Islam Khondaker 

Abstract: Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot 
learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community 
is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting 
th... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10678 [pdf, other] cs. cs.LG 


Understanding Stereotypes in Language Models: Towards Robust Measurement and 
Zero-Shot Debiasing 

Authors: Justus Mattern, Zhijing Jin, Mrinmaya Sachan, Rada Mihalcea, Bernhard Schélkopf 

Abstract: ...Accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by 
generative language models. Finally, we use this framework to investigate GPT-3's occupational gender bias and 
propose prompting techniques for mitigating these biases without the need for fine-tuning. v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10559 [pdf, other] 


Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as 
Meta-Optimizers 
Authors: Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, Furu Wei 


Abstract: ...finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient 
descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta-gradients 
according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build 
an... V More 

Submitted 21 December, 2022; v1 submitted 20 December, 2022; originally announced December 2022. 


arXiv:2212.10555 [pdf, other] c: cs.Al cs.LG 


PairReranker: Pairwise Reranking for Natural Language Generation 
Authors: Dongfu Jiang, Bill Yuchen Lin, Xiang Ren 


Abstract: ...flexibility of \textsc{PairReranker}, showing strong results, compared with previous baselines. In 
addition, our \textsc{PairReranker} can generalize to significantly improve GPT-3 (text-davinci-003) results (e.g., 
24.55\% on CommonGen and 11.35\% on WMT18 zh-en), even though our rerankers are not trained with any 
GPT-... v More 

Submitted 20 December, 2022; originally announced December 2022. 

Comments: We will release our code and data at https://inklab.usc.edu/PairReranker 


. arXiv:2212.10529 [pdf, other]  cs.c! cs.Al cs.CY 


Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological 
Perspective 
Authors: Xingxuan Li, Yutong Li, Linlin Liu, Lidong Bing, Shafiq Joty 
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Abstract: Are large language models (LLMs) like GPT-3 psychologically safe? In this work, we design unbiased 
prompts to evaluate LLMs systematically from a psychological perspective. Firstly, we test the personality traits of 
three different LLMs with Short Dark Triad (SD-3) and Big Five Inventory (BFI). We find all of them show higher 
scores on SD-3 than the human av... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10474 [pdf, other] 


ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language 
Models 

Authors: Jonas Belouadi, Steffen Eger 

Abstract: ... model, and fine-tune it on a large custom corpus of English and German quatrains annotated with our 
styles. We show that ByGPT5 outperforms other models such as mT5, ByT5, GPT-2 and ChatGPT, while also being 
more parameter efficient and performing favorably compared to humans. In addition, we analyze its runtime 
performance and introspect the model's und... v More 

Submitted 20 December, 2022; originally announced December 2022. 

Comments: Preprint 


. arXiv:2212.10467 [pdf, other] 


Generic Temporal Reasoning with Differential Analysis and Explanation 

Authors: Yu Feng, Ben Zhou, Haoyu Wang, Helen Jin, Dan Roth 

Abstract: ...subtle contextual change will affect temporal relation distributions. To facilitate learning, TODAY also 
annotates human explanations. We show that existing models, including GPT-3, drop to random guessing on 
TODAY, suggesting that they heavily rely on spurious information rather than proper reasoning for temporal 
predictions. On the other hand, we show that... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10466 [pdf, other] 


Controllable Text Generation with Language Constraints 

Authors: Howard Chen, Huihan Li, Dangi Chen, Karthik Narasimhan 

Abstract: ...for straightforward evaluation while striking a balance between broad attribute-level and narrow 
lexical-level controls. We find that even state-of-the-art language models like GPT-3 fail often on this task, and 
propose a Solution to leverage a language model's own internal knowledge to guide generation. Our method, 
called CognacGen, first queries the la... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10461 [pdf, other] 


Go-tuning: Improving Zero-shot Learning Abilities of Smaller Language Models 

Authors: Jingjing Xu, Qingxiu Dong, Hongyi Liu, Lei Li 

Abstract: With increasing scale, large language models demonstrate both quantitative improvement and new 
qualitative capabilities, especially as zero-shot learners, like GPT-3. However, these results rely heavily on delicate 
prompt design and large computation. In this work, we explore whether the strong zero-shot ability could be 
achieved at a smaller model scale wit... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10450 [pdf, other] 


Is GPT-3 a Good Data Annotator? 

Authors: Bosheng Ding, Chengwei Qin, Linlin Liu, Lidong Bing, Shafiq Joty, Boyang Li 

Abstract: GPT-3 (Generative Pre-trained Transformer 3) is a large-scale autoregressive language model developed 
by OpenAl, which has demonstrated impressive few-shot performance on a wide range of natural language 
processing (NLP) tasks. Hence, an intuitive application is to use it for data annotation. In this paper, we investigate 
whether... v More 

Submitted 20 December, 2022; originally announced December 2022. 


31. arXiv:2212.10264 [pdf, other] cs.LG cs.CL cs.SE 
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ReCode: Robustness Evaluation of Code Generation Models 


Authors: Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson 
Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, Bing Xiang 
Abstract: ...on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. 
Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most 
sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumaneEval. v More 
Submitted 20 December, 2022; originally announced December 2022. 

Comments: Code and data available at https://github.com/amazon-science/recode 


. arXiv:2212.10190 [pdf, other]  cs.c! 


Pay Attention to Your Tone: Introducing a New Dataset for Polite Language Rewrite 
Authors: Xun Wang, Tao Ge, Allen Mao, Yuki Li, Furu Wei, Si-Qing Chen 

Abstract: ...to rewrite with effort. To alleviate the human effort for efficient annotation, we first propose a novel 
annotation paradigm by a collaboration of human annotators and GPT-3.5 to annotate \textsc{PoliteRewrite}. The 
released dataset has 10K polite sentence rewrites annotated collaboratively by... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10154 [pdf, other] cs. cs.Al cs.CY  cs.LG 


Human-Guided Fair Classification for Natural Language Processing 

Authors: Florian E. Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Elliott Ash, Martin Vechev 
Abstract: ...proposes novel methods for bridging this gap by discovering expressive and intuitive individual 
fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to 
automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive 
attributes. We then validate the... v More 

Submitted 20 December, 2022; originally announced December 2022. 

Comments: 31 pages, 1 figure 


arXiv:2212.10114 [pdf, other] 


True Detective: A Challenging Benchmark for Deep Abductive Reasoning \\in Foundation 
Models 

Authors: Maksym Del, Mark Fishel 

Abstract: ...of detective puzzles. Each puzzle includes a multiple-choice question for evaluation sourced from the 
"5 Minute Mystery" platform. Our results show that state-of-the-art GPT models perform significantly worse than 
human solvers on this benchmark, with an accuracy of 28\% compared to 47\% for humans. This indicates that 
there is still a significant ga... v More 

Submitted 20 December, 2022; originally announced December 2022. 

Comments: 4 pages, preprint 


. arXiv:2212.10071 [pdf, other] cs. cs.Al  cs.LG 


Large Language Models Are Reasoning Teachers 
Authors: Namgyu Ho, Laura Schmid, Se-Young Yun 


Abstract: ...(CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the 
efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting 
deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, 
optimized to efficiently perform a specific task.... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10029 [pdf, other] cs. cs.Al 


Do language models have coherent mental models of everyday things? 
Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark 


Abstract: ...dataset consisting of 100 everyday things, their parts, and the relationships between these parts. We 
observe that state-of-the-art pre-trained language models (LMs) like GPT-3 and Macaw have fragments of 
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knowledge about these entities, but they fail to produce consistent parts mental models. We propose a simple 
extension to these LMs where we apply a constr... v More 
Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.10020 [pdf, other] 


On the Blind Spots of Model-Based Evaluation Metrics for Text Generation 

Authors: Tianxing He, Jingyu Zhang, Tianle Wang, Sachin Kumar, Kyunghyun Cho, James Glass, Yulia Tsvetkov 
Abstract: ...insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore 
ignores truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the 

beginning of generations. Further, we investigate the reasons behind these blind spots and suggest practical 
workarounds for a more reliable evaluatio... v More 

Submitted 20 December, 2022; originally announced December 2022. 


. arXiv:2212.09746 [pdf, other] 


Evaluating Human-Language Model Interaction 

Authors: Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard- 
Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony 
Lee, Rishi Bommasani, Michael Bernstein, Percy Liang 

Abstract: ...We then design five tasks ranging from goal-oriented to open-ended to capture different forms of 
interaction. On four state-of-the-art LMs (three variants of OpenAl's GPT-3 and Al21's J1-Jumbo), we find that non- 
interactive performance does not always result in better human-LM interaction and that first-person and third- 
party metrics can diverge, su... y More 

Submitted 20 December, 2022; v1 submitted 19 December, 2022; originally announced December 2022. 


. arXiv:2212.09739 [pdf, other] 


LENS: A Learnable Evaluation Metric for Text Simplification 

Authors: Mounica Maddela, Yao Dou, David Heineman, Wei Xu 

Abstract: ...4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark 
consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on 
SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical 
results show that LENS correlates better with human j... v More 

Submitted 19 December, 2022; originally announced December 2022. 


. arXiv:2212.09720 [pdf, other] cs.LG cs.NE 


The case for 4-bit precision: k-bit Inference Scaling Laws 

Authors: Tim Dettmers, Luke Zettlemoyer 

Abstract: ...parameters to examine which quantization methods improve scaling for 3 to 8-bit precision at scales 
of 19M to 66B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is 
challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block 
size -- splitting the parameters into smal... v More 

Submitted 19 December, 2022; originally announced December 2022. 


. arXiv:2212.09656 [pdf, other]  cs.c! cs.IR 


Visconde: Multi-document QA with GPT-3 and Neural Reranking 

Authors: Jayr Pereira, Robson Fidalgo, Roberto Lotufo, Rodrigo Nogueira 

Abstract: This paper proposes a question-answering system that can answer questions whose supporting 
evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step 
pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into 
simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art s... v More 

Submitted 19 December, 2022; originally announced December 2022. 


. arXiv:2212.09246 [pdf, other] 


I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation 
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Authors: Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Keisuke Sakaguchi, 
Swabha Swayamdipta, Peter West, Yejin Choi 

Abstract: ...capabilities. Or is it? In this paper, we investigate the possibility of a seemingly impossible match: can 
smaller language models with dismal commonsense capabilities (i.e., GPT-2), ever win over models that are orders 
of magnitude larger and better (i.e., GPT-3), if the smaller models are powered with novel commons... v More 
Submitted 18 December, 2022; originally announced December 2022. 


arXiv:2212.09196 [pdf, other] cs./ cs.CL 


Emergent Analogical Reasoning in Large Language Models 

Authors: Taylor Webb, Keith J. Holyoak, Hongjing Lu 

Abstract: ...lIn human cognition, this capacity is closely tied to an ability to reason by analogy. Here, we performed 
a direct comparison between human reasoners and a large language model (GPT-3) on a range of analogical 
tasks, including a novel text-based matrix reasoning task closely modeled on Raven's Progressive Matrices. We 
found that... v More 

Submitted 18 December, 2022; originally announced December 2022. 


arXiv:2212.08979 [pdf, other]  cs.c cs.LG 


Language model acceptability judgements are not always robust to context 
Authors: Koustuv Sinha, Jon Gauthier, Aaron Mueller, Kanishka Misra, Keren Fuentes, Roger Levy, Adina Williams 


Abstract: ...linguistic contexts. However, they are substantially unstable for contexts containing syntactic 
structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), 
we significantly improve models' judgements by providing contexts with matching syntactic structures, and 
conversely significantly worsen them usi... v More 

Submitted 17 December, 2022; originally announced December 2022. 


. arXiv:2212.08607 [pdf, other] cs. cs.Al — cs.LG 


MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text Generation 
Authors: Swarnadeep Saha, Xinyan Velocity Yu, Mohit Bansal, Ramakanth Pasunuru, Asli Celikyilmaz 

Abstract: ...obtains significant improvements over recent few-shot baselines like direct prompting and chain-of- 
thought prompting, while also achieving comparable performance to fine-tuned GPT-2 on out-of-domain data. 
Moreover, human evaluation shows that MURMUR generates highly faithful and correct reasoning paths that lead 
to 26% more logically consistent summaries on... y More 

Submitted 16 December, 2022; originally announced December 2022. 

Comments: 22 pages (9 figures, 18 tables) 
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The Role of Al in Drug Discovery: Challenges, Opportunities, and Strategies 

Authors: Alexandre Blanco-Gonzalez, Alfonso Cabezon, Alejandro Seco-Gonzalez, Daniel Conde-Torres, Paula 
Antelo-Riveiro, Angel Pineiro, Rebeca Garcia-Fandino 

Abstract: ...and opportunities for realizing its potential in this field. Note from the human-authors: This article 
was created to test the ability of ChatGPT, a chatbot based on the GPT-3.5 language model, to assist human 
authors in writing review articles. The text generated by the Al following our instructions (see Supporting 
Information) was used as a starting poin... v More 

Submitted 8 December, 2022; originally announced December 2022. 

Comments: 11 pages, 1 figure 
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Combining Particle Tracking with Electromagnetic Radiation Showers: Merging GPT and 
Geant4 with Visualization 
Authors: David H. Dowell, Munther Hindi, S. B. van der Geer, M. J. de Loos 


Abstract: ...are rarely included in the design and engineering of electron injectors. This work attempts to remedy 
this situation by combining two well-known and well-documented programs, GPT and Geant4, to track electrons 


and their losses in an injector beamline. This paper describes a system of programs which simulates electron 
paths and losses along the beamline. In a... v More 
Submitted 15 December, 2022; originally announced December 2022. 


48. arXiv:2212.07981 [pdf, other] 
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human 
Evaluation 
Authors: Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, 
Chien-Sheng Wu, Caiming Xiong, Dragomir Radev 
Abstract: ...significant results. Furthermore, our findings have important implications for evaluating large 
language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit 
unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling 
for more robust, targeted evaluation methods. y More 
Submitted 15 December, 2022; originally announced December 2022. 
49. arXiv:2212.07841 [pdf, other] cs. cs.IR 
MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense 
Retrievers 
Authors: Kun Zhou, Xiao Liu, Yeyun Gong, Wayne Xin Zhao, Daxin Jiang, Nan Duan, Ji-Rong Wen 
Abstract: ...semantic information of passages and relationships among them within the pre-training corpus. The 
third one can capture the knowledge beyond the corpus from external PLMs (e.g. GPT-2). Extensive experiments 
on several large-scale passage retrieval datasets have shown that our approach outperforms the previous state- 
of-the-art dense retrieval methods. Our cod... v More 
Submitted 15 December, 2022; originally announced December 2022. 
Comments: 16 pages 
50. arXiv:2212.07796 [pdf, other] c: cs.CV 
CREPE: Can Vision-Language Foundation Models Reason Compositionally? 
Authors: Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna 
Abstract: ...atomic, swapping, and negation foils. The datasets are generated by repurposing the Visual Genome 
scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find 
that model performance decreases consistently when novel compositions dominate the retrieval set, with 
Recall@1 dropping by up to 8%. For productivity,... v More 
Submitted 13 December, 2022; originally announced December 2022. 
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